Exploratory Data Analysis - Quality and Physicochemical Properties of White Wine

Christina Lentz

The dataset analyzed within this project contains physicochemical properties and sensory output (quality rating) for a number of Portuguese vinho verde wines, specifically white variants. Chemical properties were measured by objective testing, and quality was determined by the median score of taste evaluations by at least 3 wine experts.

Here, we would like to look closely at this data, and to make some conclusions about how the different chemical properties may or may not predict the final quality.

Univariate Plots Section

To begin with, we can look at the structure of the data set.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

The first thing we notice here is that the first variable, X, seems to just be a count, rather than an identifier. We can check this by seeing how many unique values the column contains:

## [1] 4898

We can then look at the summary of our data, to confirm that the highest and lowest values of X are, indeed, 1 and 4898

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Because this column is just a count, we can eliminate it, as it will not be useful for our calculations. We can subset the data, and then check to see that it was removed properly.

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Another thing that we notice is that the quality is represented as an integer. This may be how we would like to represent the value in some situations, but in the end, this is more of a categorical variable, as it is forced to take on one of 10 values and is assigned these values based on a qualitative assessment. To fix this, we can change the quality to be an ordered factor, and check to see that the change worked.

## 'data.frame':    4898 obs. of  13 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ quality.factor      : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...

Lastly, I would like to put all of the ‘concentration’ variables into the same units. I feel that it would make it easier to compare the variables, and to get a sense of the relative amounts of these different components in the wine. Because most of the concentrations are in g/dm3, I will convert the free and total sulfur dioxide content by dividing each number by 1000.

## 'data.frame':    4898 obs. of  13 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  0.045 0.014 0.03 0.047 0.047 0.03 0.03 0.045 0.014 0.028 ...
##  $ total.sulfur.dioxide: num  0.17 0.132 0.097 0.186 0.186 0.097 0.136 0.17 0.132 0.129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ quality.factor      : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...

From here, we will look at the distribution of each of the 12 variables, beginning with the distribution of the quality itself.

We can see that the wines in this data set are fairly normally distributed, with most wines in the middle of the quality scale; in fact, there is not even a single wine with quality 1, 2, or 10.

Now we can look at the variables that contribute to the quality. To do this, we can look at histograms of all of the different variables, to look at how they are distributed.

The first thing that we notice looking at all of the data is that nearly every variable is skewed to the right. For fixed acidity, citric acid, chlorides, free sulfur dioxide, total sulfur dioxide, and density, the skew looks to be due to a few very low count outliers. For these plots, we would like to simply cut off the top 1% of the data, to zoom in more on the majority of the data.

For volatile acidity, residual sugar, and sulphates, we either see a stronger skew to the right. For these, our first step will be to see what the plot looks like if we transform the x-axis to a log10 scale.

pH already appears to be normally distributed, and does not need its scale adjusted. Alcohol content is not a normal distribution, but also does not look like a shape that needs to be transformed to a different scale. Initially, it almost looks like it has 2-3 levels of alcohol content, and the plot seems to step down through the levels as it progresses to the right.

Now, we can see a few new things. Our measures of acidity all appear to be normally distributed now; pH remains the same, outliers are removed for fixed acidity and citric acid, and the x-axis is rescaled to log10 for volatile acidity. This follows similarly for sources of SO2, where the outliers are removed for free sulfur dioxide and total sulfur dioxide, and the x-axis is rescaled to log10 for sulphates. The histogram for chlorides was further limited to remove the top 3% of the data, because of an excessively long tail; however, the remaining 97% of the data also looks to be normally distributed.

Residual sugar, now on a log10 scale, appears to be a bimodal distribution, with a distinctive lower and higher peak.

Density does not appear to be a normal distribution, now that we have a better view of the shape of the data. Instead, we see that the top of the distribution is much flatter than a normal distribution, and then tapers steeply on the edges.

Additional Variables

First, seeing the fact that we have a free and total sulfur dioxide number, I am interested in adding the bound sulfur dioxide number to the data frame, which is simply subtracting the free from the total. The statistical summary for this variable, and a histogram with the top and bottom 1% of data removed, is below; like the two variables it is calculated from, the distribution is slightly right skewed, but close to normal.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0040  0.0780  0.1000  0.1031  0.1250  0.3310

After doing some research into winemaking, I would also like to investigate and add several other variables to the dataset.

First, I want to look at the SO2 levels in the wine. (Source: https://www.accuvin.com/wp-content/uploads/2015/04/How-SO2-and-pH-are-Linked.pdf) SO2 is produced naturally during fermentation, but it is also added afterwards, to help act as a preservative. A certain level of SO2 is desired for its antimicrobial properties, but too much SO2 gives the wine an astringent taste; thus, there is a balance that must be met.

Only a portion of the free SO2 (and none of the bound SO2) acts as a preservative, and the percentage that does depends on pH. It may be informative to look at this SO2 that acts as a preservative, called molecular SO2. Because a certain amount is needed based on pH value, there may be cases where the amount needed for antimicrobial reasons results in poorer tasting wine, and thus lower quality ratings.

From the data in the source, we see this chart for the relationship between pH and the percentage of free SO2 that is in the molecular form.

##     pH percent.molecular.SO2
## 1  3.0                  6.06
## 2  3.1                  4.88
## 3  3.2                  3.91
## 4  3.3                  3.13
## 5  3.4                  2.51
## 6  3.5                  2.00
## 7  3.6                  1.60
## 8  3.7                  1.27
## 9  3.8                  1.01
## 10 3.9                  0.81
## 11 4.0                  0.64

We can then plot this data to see the relationship between pH and molecular SO2.

This looks like an exponential relationship, so we can look at the natural log of both variables:

This is, at least visually, a perfect linear fit. If we fit the data, this should give us an equation which we can use to calculate molecular SO2 from pH.

## 
## Calls:
## m1: lm(formula = log(percent.molecular.SO2) ~ log(pH))
## 
## ===============================
##   (Intercept)       10.450***  
##                     (0.156)    
##   log(pH)           -7.818***  
##                     (0.125)    
## -------------------------------
##   R-squared          0.998     
##   adj. R-squared     0.997     
##   sigma              0.038     
##   F               3930.741     
##   p                  0.000     
##   Log-likelihood    21.589     
##   Deviance           0.013     
##   AIC              -37.179     
##   BIC              -35.985     
##   N                 11         
## ===============================

This gives us the following formula:

\[ln(percent.molecular.SO_2) = -7.818*ln(pH) + 10.450\] \[ percent.molecular.SO_2 = e^{(-7.818*ln(pH) + 10.450)}\] And then to calculate the concentration of molecular SO2, we multiply this percentage by the amount of free SO2:

\[ molecular.SO_2 = free.sulfur.dioxide*0.01*e^{(-7.818*ln(pH) + 10.450)}\]

We can also look at the distribution of this variable:

Because of the right skew, we will also look at it with the x-axis scaled as log10.

This gives us a mostly normal distribution for molecular SO2 concentration.

Second, I want to consider the balance that winemakers are trying to find, between acidity and sulfur dioxide concentration. These two variables are very intertwined with one another. As described above, the amount of molecular sulfur dioxide available depends on the pH of the wine, which they cah alter by adding citric acid. More citric acid will decrease the pH, which increases the amount of molecular SO2 that is available for antimicrobial properties. Citric acid is also added for freshness, but makes the wine more microbially unstable. Sulfur dioxide is added for stability, but too much can make the wine astringent.

Because these two additives have to be balanced against one another, I think it would be intersting to look at the ratio of citric acid to sulphates.

Again, when we plot the data and ignore the outliers, we end up with a distribution that is close to normal.

Univariate Analysis

What is the structure of your dataset?

The dataset contains 4898 wines, with 12 variables describing each one - 11 numerical measurements of physicochemical properties, and one categorical variable assessing the quality of the wine

What is/are the main feature(s) of interest in your dataset?

The primary feature of interest is whether we are able, from the 11 numerical variables, to predict the quality.

What other features in the dataset do you think will help support yourinvestigation into your feature(s) of interest?

At this point, it is hard to tell which of the variables will actually have the most impact on the quality, especially given how intertwined some of the variables are.

I am interested in how the additives (citric acid, sulphates) impact the other variables in the wine. These are the only components where a winemaker can choose the specific value, whereas the other variables are a sum of all of the different growing and production choices.

I am also interested in looking at the residual sugar, as it has two distinct peaks - I am curious to see whether wines in each peak are of the same quality and relate to other variables the same way.

Did you create any new variables from existing variables in the dataset?

I created three new variables - bound sulfur dioxide, the concentration of molecular SO2, and the ratio of citric acid to sulphates.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The most interesting distributions were alcohol and residual sugar. Alcohol looks to possibly have three peaks, and residual sugar definitively has two.

In changing the data, added a new ‘quality.factor’ variable, which changes the original from an integer to an ordered factor. This variable, despite being a number value, is restricted to only the number 0-10, and each number is associated with a quality, not a numerical measurement. This makes it more suited to be a factor.

I also adjusted the units of the free and total sulfur dioxide, to be consistent with the other concentration measurements.

Within the histograms of individual variables, I cut off the top 1% of the data for the majority of the variables, in an effort to better see the shape of the remaining 99% of the data. There appear to be outliers that skew all of the data to the right. For volatile acidity, residual sugar, and sulphates, I also transformed the x-axis to a log10 scale.

Bivariate Plots Section

Assessment of How Variables Relate to Quality

To begin with, we can look at the paired interactions between the variables.

It is difficult to see the small details due to the scale of the plot, but this still shows us the most obvious trends, as well as provides correlation coefficients for each pair of variables. From these plots, we see several weak correlations with quality, and no strong ones. For the sake of having a cutoff, I will look more closely at any correlation coefficients above 0.1:

Alcohol: 0.436
Density: -0.307
Bound Sulfur Dioxide: -0.218
Chlorides: -0.201
Volatile Acidity: -0.195
Total Sulfur Dioxide: -0.175
Fixed Acidity: -0.114

  1. Alcohol and Quality

From the plots, we can see several things that agree with the positive correlation coefficient. First, a slight upward trend is visible in the quality vs. alcohol content scatter plot.

Second, looking at the box plot for alcohol as a function of quality (as a factor), we can observe the median value of alcohol rising as we increase from a quality of 5 to 9.

Lastly, when we look at the histogram for alcohol, with the bars colored to represent the quality, we can see that as we travel to higher alcohol content, we see more of the higher quality colors.

Overall, each of these three plots supports a relationship between quality and alcohol content.

One thing to consider here is whether we can better visualize the trends if we put the quality numbers into larger ‘buckets’. If we look at a quality rating of 9, for example, our box plot shows us an excessively narrow IQR, which is likely indicative of a very small sample size, as we know that most of the wines fall at a quality level of 5 or 6. To look at the data grouped into these larger buckets, we can cut it into High (>6), Medium (5,6), and Low (<5) regions. These numbers were chosen based on the IQR for the quality variable, which contains all of the wines of quality 5 or 6.

This helps make the trends in the box plot and histogram more obvious.

We see similar results to this when we look at the density, bound sulfur dioxide, chlorides, and volatile acidity. Like above, we see a consistent reflection of the reported correlation coefficient, in each of these cases negative. The scatter plot has a downward slope, the medians and IQRs on the box plot are dropping as we increase in quality, and the histogram shows noticeably more of the higher quality ratings towards the lower end of the x-axis.

  1. Density and Quality

  1. Bound SO2 and Quality

  1. Chlorides and Quality

  1. Volatile Acidity and Quality

For the last two variables on the list, total sulfur dioxide and fixed acidity, we don’t see much of a relationship between the variable and quality on any of the three plots. In the scatter plots, there is no noticeable trend amongst the points, and this is reflected in the fact that there is not a consistent rise or fall in the median and IQR across quality levels. Additionally, the amount of higher and lower quality wines follows the general trend of the histogram for both.

  1. Total Sulfur Dioxide and Quality

  1. Fixed Acidity and Quality

In addition to looking at how the variables with the highest correlation coefficients are related to quality, I would also like to look more closely at the variables that I created - Molecular SO2 and Citric Acid to Sulphate ratio. I am also interested to look independently at citric acid and sulphates, as I mentioned above, because they are both independently added by the winemaker.

  1. Molecular SO2 and Quality

  1. Citrid Acid to Sulphates Ratio and Quality

  1. Citric Acid and Quality

  1. Sulphates and Quality

In all four cases, we see no obvious slope to the scatter plot, no consistent trend amongst the medians and IQRs on the box plot, and all of the quality levels seem to trend together. Despite what I may have been hoping to see, it doesn’t seem that the sulphates and citric acid are good predictors of quality, nor is the ratio between the two. Likewise, the molecular SO2 does not seem to correlate with quality.

Lastly, I would like to look separately at how residual sugar is impacted by quality - because of the two peaks in the data, there may be a relationship with quality that is hidden in the data.

  1. Residual Sugar and Quality

Looking at the plots for residual sugar, we really do not see a relationship between residual sugar and quality. There are more high quality wines at low residual sugar, but there are also more low quality wines at low residual sugar! The points are concentrated in the region below a residual sugar of ~3, and are more diffuse above that. We see from the histogram that the high quality wines seem to trend with the general shape of the distribution.

Assessment of Other Interactions Between Variables

From here, we would also like to look at how other variables, other than quality, relate to one another, to help us get a larger picture of the dataset.

To start this analysis, I would like to look primarily at how the other variables are related to the alcohol content, as it is the strongest correlation to the quality. The most correlated values with alcohol, and their correlation coefficients, are:

Density: -0.78
Residual Sugar: -0.451
Bound SO2: -0.427
Chlorides: -0.36

Density, bound SO2, and chlorides are all also weakly correlated with quality; in fact, they are more strongly correlated with alcohol than directly with quality. Residual sugar is somewhat correlated with alcohol, even though it has less relation to quality (r = -0.0976). I also would like to consider how these variables relate to one another.

Residual Sugar and Density: 0.834
Residual Sugar and Bound SO2: 0.345
Density and Bound SO2: 0.504
Density and Chlorides: 0.257
Chlorides and Bound SO2: 0.194

With this in mind, we can look at all of the pairs of variables, beginning with the four plots involving alcohol

  1. Alcohol vs density

As expected with a r value of -0.78, density and alcohol show a fairly strong correlation.

  1. Alcohol vs Residual Sugar

There is far less obvious correlation between alcohol and residual sugar. Looking at the plot, we can see that there is a large number of points grouped near the bottom, and then the points are spread out more at the top. This actually is consistent with what we saw with the residual sugar histogram - the distribution was bimodal, with the two peaks most obvious on a log scale. To get a better look at the points at the bottom of the plot, we can look at things in this same log scale:

We now can see two distinct regions again - the higher residual sugar peak seems to have a roughly negative correlation with alcohol. The lower residual sugar peak seems to be mostly flat.

  1. Alcohol vs Chlorides

Here, like with density, we see a distinctive negative trend. The thing here that decreased the correlation coefficient is a much larger number of outliers. When we originally looked at the histograms, we saw that for nearly all variables, some outliers in the top 1% of the data fell far outside the mostly normally distibuted data. For chlorides, it was even more. The plot above has the top 3% of data removed, and it is the remainder of the data that has this fairly strong correlation with alcohol.

  1. Alcohol vs. Bound SO2

In this plot, we see a very visible, downward trend, but with quite a bit of spread in the data.

Now that we have looked at how these top few variables are related to alcohol, we also would like to look at how they relate to each other.

  1. Residual Sugar vs Density

Similar to the plot of residual sugar versus alcohol content, residual sugar versus density shows a large number of points grouped at low residual sugar values. This is the first peak in our residual sugar histogram. Interestingly, this is the highest correlation coefficient amongst the variables.

Changing our y-axis to a log scale, we see two things. First, that while the low residual sugar peak does not vary with density, the wines in the high residual sugar peak have a fairly strong positive correlation with density. It also interesitng to note that above a certain density, we essentially only have wines in the second peak. This latter part contributes a lot to the fact that we have an r value of 0.839.

  1. Residual Sugar vs Bound SO2

Again, as we are looking at residual sugar, we would like to look at the axis on a log scale:

Looking at the two residual sugar regions on the scatter plot, we see that both have at best a very slight linear relationship with bound sulfur dioxide.

  1. Bound SO2 vs Density

Bound SO2 and density appear to be positively correlated

  1. Chlorides vs Density

Chlorides and density appear to be positively correlated; however, it is of note here that this is the bottom 97% of the chlorides data, and quite a few high outliers are ignored here.

  1. Chlorides and Bound Sulfur Dioxide

Similar to the previous plot, the bottom 97% of the chlorides data seems to be positively correlated with bound sulfur dioxide.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality is not strongly correlated with any of the variables; the largest correlation coefficient was with alcohol, an r value of 0.439. Other than alcohol, relationships between quality and density, chlorides, bound sulfur dioxide, and volatile acidity were observed.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Residual sugar is still interesting, in that the lower peak does not necessarily have the same relationship with a given variable that the upper peak does - for instance, with the strong relationship between density and residual sugar, we see a strong linear correlation for the higher residual sugar wines, and much less for the lower residual sugar wines.

What was the strongest relationship you found?

The two strongest relationships between any variables were between density and alcohol, and density and residual sugar.

Multivariate Plots Section

To start, in order to plot third variables on our scatter plots, we need to create more “bucket” variables, which group the wine into a smaller number of buckets, and give us categorical variables where we had numerical. Here we will make them for residual sugar, alcohol, chloride, bound SO2, and volatile acidity. For residual sugar, we use the median as the division between ‘high’ and ‘low’, since it is already naturally divided into two peaks. For the other variables, Q1, Median, and Q3 are used as division points between the levels.

The first multivariate plot that I want to look at is one with alcohol, density, and residual sugar. This encompasses the two strongest interactions: alcohol/density, and density/residual sugar.

Looking at this plot, we can really visualize well how these three variables interact with one another. Alcohol and density, which was also visualized in the bivariate plot section, shows a negative correlation. The residual sugar here is split at the median of the data; we see here that there are more high residual sugar points at low alcohol and high density.

This makes sense, since sugar is fermented by yeast to make alcohol, so as alcohol is created, the sugar remaining in the wine decreases. At higher alcohol content, there is pretty much only ‘low’ residual sugar wines. At lower alcohol content, we see a range residual sugar content. This makes sense, though, if we consider that not every wine begins with the same amount of sugar, so it is possible to have the same amount of alcohol from fermentation, but a different amount of residual sugar. Alcohol is also less dense than sugar, so with lower density, we consistently have more alcohol and less sugar.

From here, we can begin to add other variables to the plot. To add a fourth variable, we will convert the residual sugar category to being a varying shape, and we can then represent the fourth variable with the color. To begin with, we will look at quality, as this is the variable of primary interest in the dataset.

We see here that, other than a few outliers, the high quality data is present at low density, and high alcohol content. We also see that most of the data falls into the medium quality range, which we saw in our initial histogram.

One interesting observation is that most of the high quality outliers in the dataset seem to occur in the ‘high’ residual sugar wines, as we can see the blue triangles, quite dark in a few cases, but we do not see the same in circles.

This also highlights just how many of the wines fall into this ‘medium’ range - the plot is dominated by green points.

Next, we can look at chlorides:

Chlorides trend such that low chlorides are seen at low alcohol, and high chlorides at high alcohol. Because the colors appear actually fairly neatly in horizontal bands, the chlorides seem to correlate with alcohol content much more than with density or residual sugar. We see that the high chloride wines, for example, have a wide range of density values, and encompass both residual sugar ranges. However, the bulk of the values are mostly limited to a small range of alcohol content.

Next, we can look at the bound sulfur dioxide:

This is an interesting plot, because it seems that the bound sulfur dioxide may be related to alcohol, density, and residual sugar. One thing to consider here is that sulfur dioxide can bind to residual sugars (Source: https://www.accuvin.com/wp-content/uploads/2015/04/How-SO2-and-pH-are-Linked.pdf). So it makes sense that when there is higher residual sugar content, we see more bound SO2. In general, we seem to see higher amounts of bound sulfur dioxide at low alcohol, high density, and high residual sugar.

Lastly, we can look at the volatile acidity. It is not correlated with density, residual sugar, or alcohol, but I am curious to see the interaction here after seeing that volatile acidity does seem to correlate loosely with quality.

Ultimately, this just shows what we knew: volatile acidity is not correlated with these other variables. Interestingly, it seems that there are more high volatile acidity wines at the high and low end of the alcohol spectrum, with lower volatile acidity in the middle.

Linear Model

Now that we have looked at the relationships between the variables, we can attempt to build a linear model from the data.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wines)
## m2: lm(formula = quality ~ alcohol + density, data = wines)
## m3: lm(formula = quality ~ alcohol + density + chlorides, data = wines)
## m4: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide, 
##     data = wines)
## m5: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide + 
##     log(residual.sugar), data = wines)
## 
## ==============================================================================================
##                              m1            m2            m3            m4            m5       
## ----------------------------------------------------------------------------------------------
##   (Intercept)               2.582***    -22.492***    -21.150***    -28.574***     40.641***  
##                            (0.098)       (6.165)       (6.162)       (6.437)       (9.877)    
##   alcohol                   0.313***      0.360***      0.343***      0.341***      0.270***  
##                            (0.009)       (0.015)       (0.015)       (0.015)       (0.017)    
##   density                                24.728***     23.671***     31.315***    -37.878***  
##                                          (6.079)       (6.074)       (6.369)       (9.832)    
##   chlorides                                            -2.382***     -2.245***     -1.868***  
##                                                        (0.558)       (0.558)       (0.555)    
##   bound.sulfur.dioxide                                               -1.492***     -1.470***  
##                                                                      (0.380)       (0.376)    
##   log(residual.sugar)                                                               0.196***  
##                                                                                    (0.021)    
## ----------------------------------------------------------------------------------------------
##   R-squared                 0.190         0.192         0.195         0.198         0.212     
##   adj. R-squared            0.190         0.192         0.195         0.197         0.211     
##   sigma                     0.797         0.796         0.795         0.793         0.787     
##   F                      1146.395       583.290       396.315       301.971       262.554     
##   p                         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -5839.391     -5831.127     -5822.011     -5814.298     -5772.448     
##   Deviance               3112.257      3101.773      3090.247      3080.530      3028.336     
##   AIC                   11684.782     11670.255     11654.021     11640.596     11558.896     
##   BIC                   11704.272     11696.241     11686.504     11679.576     11604.372     
##   N                      4898          4898          4898          4898          4898         
## ==============================================================================================
## 
## Calls:
## m6: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide + 
##     log(residual.sugar) + volatile.acidity, data = wines)
## m7: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide + 
##     log(residual.sugar) + volatile.acidity + fixed.acidity, data = wines)
## m8: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide + 
##     log(residual.sugar) + volatile.acidity + fixed.acidity + 
##     citric.acid, data = wines)
## m9: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide + 
##     log(residual.sugar) + volatile.acidity + fixed.acidity + 
##     citric.acid + pH, data = wines)
## m10: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide + 
##     log(residual.sugar) + volatile.acidity + fixed.acidity + 
##     citric.acid + pH + sulphates, data = wines)
## 
## ==============================================================================================
##                              m6            m7            m8            m9           m10       
## ----------------------------------------------------------------------------------------------
##   (Intercept)              39.617***     21.945*       22.433*       40.144***     50.499***  
##                            (9.540)      (10.365)      (10.394)      (11.315)      (11.452)    
##   alcohol                   0.311***      0.331***      0.330***      0.303***      0.289***  
##                            (0.017)       (0.017)       (0.017)       (0.019)       (0.019)    
##   density                 -36.900***    -18.863       -19.349       -38.353***    -48.839***  
##                            (9.496)      (10.359)      (10.388)      (11.444)      (11.583)    
##   chlorides                -0.826        -0.904        -0.945        -0.742        -0.733     
##                            (0.539)       (0.538)       (0.542)       (0.544)       (0.542)    
##   bound.sulfur.dioxide     -0.278        -0.251        -0.265        -0.416        -0.656     
##                            (0.369)       (0.369)       (0.369)       (0.371)       (0.372)    
##   log(residual.sugar)       0.218***      0.188***      0.189***      0.230***      0.252***  
##                            (0.021)       (0.022)       (0.022)       (0.024)       (0.024)    
##   volatile.acidity         -2.093***     -2.111***     -2.099***     -2.061***     -2.029***  
##                            (0.111)       (0.111)       (0.113)       (0.113)       (0.113)    
##   fixed.acidity                          -0.061***     -0.064***     -0.027        -0.020     
##                                          (0.014)       (0.015)       (0.017)       (0.017)    
##   citric.acid                                           0.061         0.094         0.071     
##                                                        (0.096)       (0.096)       (0.096)    
##   pH                                                                  0.355***      0.327***  
##                                                                      (0.090)       (0.090)    
##   sulphates                                                                         0.522***  
##                                                                                    (0.099)    
## ----------------------------------------------------------------------------------------------
##   R-squared                 0.265         0.267         0.268         0.270         0.274     
##   adj. R-squared            0.264         0.266         0.266         0.269         0.273     
##   sigma                     0.760         0.759         0.759         0.757         0.755     
##   F                       293.447       255.097       223.232       200.732       184.455     
##   p                         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -5601.615     -5592.297     -5592.098     -5584.367     -5570.382     
##   Deviance               2824.290      2813.564      2813.336      2804.469      2788.499     
##   AIC                   11219.230     11202.593     11204.196     11190.735     11164.763     
##   BIC                   11271.202     11261.062     11269.161     11262.197     11242.722     
##   N                      4898          4898          4898          4898          4898         
## ==============================================================================================
## 
## Calls:
## m11: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide + 
##     log(residual.sugar) + volatile.acidity + fixed.acidity + 
##     citric.acid + pH + sulphates + total.sulfur.dioxide, data = wines)
## m12: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide + 
##     log(residual.sugar) + volatile.acidity + fixed.acidity + 
##     citric.acid + pH + sulphates + total.sulfur.dioxide + free.sulfur.dioxide, 
##     data = wines)
## m13: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide + 
##     log(residual.sugar) + volatile.acidity + fixed.acidity + 
##     citric.acid + pH + sulphates + total.sulfur.dioxide + free.sulfur.dioxide + 
##     molecular.SO2, data = wines)
## m14: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide + 
##     log(residual.sugar) + volatile.acidity + fixed.acidity + 
##     citric.acid + pH + sulphates + total.sulfur.dioxide + free.sulfur.dioxide + 
##     molecular.SO2 + citric.to.sulphates, data = wines)
## 
## ================================================================================
##                             m11           m12           m13           m14       
## --------------------------------------------------------------------------------
##   (Intercept)              47.757***     47.757***     49.834***     48.918***  
##                           (11.437)      (11.437)      (11.437)      (11.443)    
##   alcohol                   0.296***      0.296***      0.293***      0.290***  
##                            (0.019)       (0.019)       (0.019)       (0.019)    
##   density                 -46.226***    -46.226***    -49.791***    -49.035***  
##                           (11.567)      (11.567)      (11.593)      (11.596)    
##   chlorides                -0.818        -0.818        -0.828        -0.872     
##                            (0.541)       (0.541)       (0.540)       (0.541)    
##   bound.sulfur.dioxide     -4.426***     -4.426***      1.187         1.266     
##                            (0.839)       (0.839)       (1.751)       (1.751)    
##   log(residual.sugar)       0.232***      0.232***      0.237***      0.235***  
##                            (0.025)       (0.025)       (0.025)       (0.025)    
##   volatile.acidity         -1.953***     -1.953***     -1.948***     -1.942***  
##                            (0.114)       (0.114)       (0.114)       (0.114)    
##   fixed.acidity            -0.014        -0.014        -0.008        -0.010     
##                            (0.017)       (0.017)       (0.017)       (0.017)    
##   citric.acid               0.037         0.037         0.014        -0.641     
##                            (0.096)       (0.096)       (0.096)       (0.344)    
##   pH                        0.317***      0.317***      0.770***      0.772***  
##                            (0.090)       (0.090)       (0.153)       (0.153)    
##   sulphates                 0.502***      0.502***      0.511***      0.910***  
##                            (0.099)       (0.099)       (0.098)       (0.224)    
##   total.sulfur.dioxide      3.480***      3.480***     -2.101        -2.145     
##                            (0.694)       (0.694)       (1.678)       (1.678)    
##   molecular.SO2                                       132.015***    132.566***  
##                                                       (36.155)      (36.145)    
##   citric.to.sulphates                                                 0.304*    
##                                                                      (0.153)    
## --------------------------------------------------------------------------------
##   R-squared                 0.278         0.278         0.280         0.280     
##   adj. R-squared            0.276         0.276         0.278         0.278     
##   sigma                     0.754         0.754         0.753         0.752     
##   F                       170.797       170.797       158.070       146.300     
##   p                         0.000         0.000         0.000         0.000     
##   Log-likelihood        -5557.825     -5557.825     -5551.150     -5549.185     
##   Deviance               2774.238      2774.238      2766.687      2764.468     
##   AIC                   11141.649     11141.649     11130.300     11128.369     
##   BIC                   11226.105     11226.105     11221.252     11225.818     
##   N                      4898          4898          4898          4898         
## ================================================================================

Looking at the results from the linear model, we see that the model overall is fairly poor; the R2 value from adding all of the variables is only 0.280.

Most variables have very little effect on the R2 value, though volatile acidity actually creates the largest jump. This is interesting, because it is actually correlated less with quality than alcohol, density, chlorides, or bound SO2, which have less effect on the final R2.

One thing that I do notice, though, is that volatile acidity is the first source of acidity that is added to the model. If instead I introduced fixed acidity as variable 6, do we see the same jump?

## 
## Calls:
## m4: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide, 
##     data = wines)
## m5: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide + 
##     log(residual.sugar), data = wines)
## m6: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide + 
##     log(residual.sugar) + fixed.acidity, data = wines)
## m7: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide + 
##     log(residual.sugar) + fixed.acidity + volatile.acidity, data = wines)
## 
## ================================================================================
##                              m4            m5            m6            m7       
## --------------------------------------------------------------------------------
##   (Intercept)             -28.574***     40.641***     25.937*       21.945*    
##                            (6.437)       (9.877)      (10.737)      (10.365)    
##   alcohol                   0.341***      0.270***      0.286***      0.331***  
##                            (0.015)       (0.017)       (0.018)       (0.017)    
##   density                  31.315***    -37.878***    -22.870*      -18.863     
##                            (6.369)       (9.832)      (10.730)      (10.359)    
##   chlorides                -2.245***     -1.868***     -1.940***     -0.904     
##                            (0.558)       (0.555)       (0.554)       (0.538)    
##   bound.sulfur.dioxide     -1.492***     -1.470***     -1.457***     -0.251     
##                            (0.380)       (0.376)       (0.376)       (0.369)    
##   log(residual.sugar)                     0.196***      0.171***      0.188***  
##                                          (0.021)       (0.022)       (0.022)    
##   fixed.acidity                                        -0.051***     -0.061***  
##                                                        (0.015)       (0.014)    
##   volatile.acidity                                                   -2.111***  
##                                                                      (0.111)    
## --------------------------------------------------------------------------------
##   R-squared                 0.198         0.212         0.214         0.267     
##   adj. R-squared            0.197         0.211         0.213         0.266     
##   sigma                     0.793         0.787         0.786         0.759     
##   F                       301.971       262.554       221.298       255.097     
##   p                         0.000         0.000         0.000         0.000     
##   Log-likelihood        -5814.298     -5772.448     -5766.421     -5592.297     
##   Deviance               3080.530      3028.336      3020.892      2813.564     
##   AIC                   11640.596     11558.896     11548.843     11202.593     
##   BIC                   11679.576     11604.372     11600.815     11261.062     
##   N                      4898          4898          4898          4898         
## ================================================================================

Following the R2 data again here, we see the same jump with volatile acidity. This means that most likely, there is some relation between volatile acidity and quality that was not evident from the initial correlation constant.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

This gave me a very solid understanding of the relationship between residual sugar, alcohol, and density. Putting the quality on top of those three variables gave me a good sense of how quality was affected by all of them, and how it definitely was affected by the interactions of the three variables, rather than one variable dominating the quality on its own

Were there any interesting or surprising interactions between features?

I thought that adding the chlorides and bound SO2 to the alcohol/density/residual sugar plot was interesting. It helped to give me a better sense of how those two variables interacted with the other three. I also thought it was interesting that adding volatile acidity to the linear model created the largest jump. It was one of the more weakly correlated variables with quality on its own, and when plotted with the other variables that influence the quality, it didn’t show any relation.

Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created a linear model, where I added each variable, one at a time, to the model, and observed the change in R2. What I primarily found is that the model is quite weak, with an R2 value of only 0.280 with all variables added in. I think that for this data set, a linear model is likely too simple. Too many of the variables depend on one another, and more interaction terms are needed to be able to properly model the results.


Final Plots and Summary

Plot One

Description One

This plot shows the relationship between quality and the alcohol content of the wine. From the correlation coefficients, alcohol has the strongest individual relationship with the quality rating.

In this plot, we can see the positive slope of the scatter plot. In the box plot, we can see that the median alcohol content, and the IQR, are greatest for the high quality samples. In the histogram, we can see that again, more high quality samples occur with higher alcohol content.

This also helped me determine that I wanted to look at the quality in these buckets, rather than looking at each individual rating level. It helps to make the trends more obvious, and helps minimize the effects of outliers.

Plot Two

Description Two

This plot is the representation of the relationship between alcohol, density, and residual sugar. This contains the variables that are most correlated with one another (alcohol and density; residual sugar and density) and gives us a good sense of the interrelation between these three variables and provides a lot of insight into the data set.

Plot Three

Description Three

This plot shows us quality as it relateds to alcohol content, density, and residual sugar. We wee that we have the highest quality at low density, high alcohol content wines, with low residual sugar.

Plotting the quality like this also shows us how dominated the data is by medium quality wines. There are wines of medium quality spread across the ranges of both alcohol and density.


Reflection

Overall, this data is quite limited by several things.

First, we are limited the fact that so many of the wines fall within the middle range of the data. I wonder whether all of these wines are actually very mediocre in quality, or if it is an artifact of the way that the measurement was made. In the description of the data, it mentions that the quality rating was the median of at least three opinions from wine experts. How many experts evaluated each wine on average? If most were on the lower end, like three, we are basing the quality off of a very small sample size, which may not represent the actual wine quality well. Alternatively, even a large sample size may tend to put most wines in a middling range, because we would expect a wide range of personal tastes, making it likely that each wine would be hated by some and loved by others, and ending up with a median at medium quality.

Second, many of the variables interact with one another. Density, alcohol, and residual sugar are intertwined closely. Bound SO2 is related to residual sugar. All of the forms of SO2 are related to one another, as are all of the measures of acidity. It makes it very difficult to create a model, when all of the variables interact with multiple other variables to some degree. To get an accurate model, more of these interaction terms would need to be added to the model.

Overall, I found it difficult to decide what to plot, and what section of the data to focus on, given that there weren’t strong interactions between many of the variables. It was especially challenging that there really wasn’t any variable that was strongly correlated with our variable of interest.

The multivariate plots at the end were helpful to get a better understanding of some of the relationships between the variables that correlate the closest with quality. It was very interesting to see how the triplet of residual sugar, alcohol, and density interacted, and then to put chlorides and bound sulfur dioxide into the same plot as well, to see how they also interacted.

I found it surprising that the variables that correlated with quality actually contributed very little to the linear model - having all of the variables included was not much improvement over the model with only alcohol in it.

For future work, I think that more interaction terms need to be explored. I looked at one ratio between variables, which did not have an impact on the model, but looking at other ratios may make sense. Additionally, using a non-linear model may be an improvement. I would also love to have more information about the quality variable, and exactly what goes into each value, so we know if the median score is representative of the whole group of experts.